Tarea realizada por Carlos Sánchez Polo y Jesús Martínez Leal¶
Última edición: 01/03/2024
- Ejemplo inicial
- Ejercicios con OpenTSNE
- Load data
- - Empieza ejecutando la función con los valores por defecto (perplexity = 30, early_exaggeration = 12, initialization='pca') para un subconjunto de train del 75% de la muestra.
- - Ejecuta el modelo sin early_exaggeration (early_exaggeration=1). ¿Qué diferencias observas y a qué se deben?
- - Ejecuta el modelo con los valores por defecto pero cambiando la inicialización a
random. ¿Qué ocurre? ¿Obtenemos mejores o peores resultados que en el caso anterior? Compara también los tiempos de ejecución y comenta porqué difieren. - - Ejecuta el modelo con 2 valores muy dispares de perplexity, por ejemplo 1 y 100, (y el resto de valores por defecto) y comenta los resultados.
- - De todas las configuraciones de t-SNE probadas en los ejercicios anteriores, escoge la que mejores resultados obtiene y aplica los datos de test al embedding. Representa el dataset entero.
- Ejercicios con TSNE de sklearn
Descargarse el fichero utils de https://github.com/pavlin-policar/openTSNE/blob/master/examples/utils.py
from openTSNE import TSNE
from resources import utils
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
Cargar datos¶
En este ejemplo se utiliza el dataset Macosko 2015, que son datos de retina de ratón. Se trata de un dataset bastante conocido y bastante explorado en la literatura. Se puede obtener en el siguiente enlace: http://file.biolab.si/opentsne/macosko_2015.pkl.gz
import gzip
import pickle
with gzip.open("data/macosko_2015.pkl.gz", "rb") as f:
data = pickle.load(f)
x = data["pca_50"]
y = data["CellType1"].astype(str)
print("Data set contains %d samples with %d features" % x.shape)
Data set contains 44808 samples with 50 features
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=42)
print("%d training samples" % x_train.shape[0])
print("%d test samples" % x_test.shape[0])
30021 training samples 14787 test samples
tsne = TSNE(
perplexity=30,
metric="euclidean",
n_jobs=-1,
random_state=42,
verbose=True,
)
%time embedding_train = tsne.fit(x_train)
utils.plot(embedding_train, y_train, colors=utils.MACOSKO_COLORS)
-------------------------------------------------------------------------------- TSNE(early_exaggeration=12, n_jobs=-1, random_state=42, verbose=True) -------------------------------------------------------------------------------- ===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
--> Time elapsed: 2.80 seconds ===> Calculating affinity matrix... --> Time elapsed: 0.57 seconds ===> Calculating PCA-based initialization... --> Time elapsed: 0.08 seconds ===> Running optimization with exaggeration=12.00, lr=2501.75 for 250 iterations... Iteration 50, KL divergence 5.1602, 50 iterations in 1.1982 sec Iteration 100, KL divergence 5.1000, 50 iterations in 1.2712 sec Iteration 150, KL divergence 5.0648, 50 iterations in 1.2973 sec Iteration 200, KL divergence 5.0503, 50 iterations in 1.3088 sec Iteration 250, KL divergence 5.0422, 50 iterations in 1.3080 sec --> Time elapsed: 6.38 seconds ===> Running optimization with exaggeration=1.00, lr=30021.00 for 500 iterations... Iteration 50, KL divergence 3.0021, 50 iterations in 1.3138 sec Iteration 100, KL divergence 2.7919, 50 iterations in 2.3479 sec Iteration 150, KL divergence 2.6944, 50 iterations in 3.5985 sec Iteration 200, KL divergence 2.6360, 50 iterations in 4.9340 sec Iteration 250, KL divergence 2.5954, 50 iterations in 5.9422 sec Iteration 300, KL divergence 2.5646, 50 iterations in 7.0011 sec Iteration 350, KL divergence 2.5405, 50 iterations in 7.9974 sec Iteration 400, KL divergence 2.5218, 50 iterations in 8.5782 sec Iteration 450, KL divergence 2.5051, 50 iterations in 9.5600 sec Iteration 500, KL divergence 2.4925, 50 iterations in 10.5547 sec --> Time elapsed: 61.83 seconds CPU times: total: 13min 5s Wall time: 1min 11s
Transformación¶
Actualmente openTSNE es la única librería que permite meter en el embedding nuevos puntos.
%time embedding_test = embedding_train.transform(x_test)
utils.plot(embedding_test, y_test, colors=utils.MACOSKO_COLORS)
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search... --> Time elapsed: 0.76 seconds ===> Calculating affinity matrix... --> Time elapsed: 0.05 seconds ===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations... --> Time elapsed: 0.00 seconds ===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations... Iteration 50, KL divergence 213893.9378, 50 iterations in 0.2370 sec Iteration 100, KL divergence 212358.2086, 50 iterations in 0.2640 sec Iteration 150, KL divergence 211368.8011, 50 iterations in 0.2532 sec Iteration 200, KL divergence 210642.9236, 50 iterations in 0.2411 sec Iteration 250, KL divergence 210092.0278, 50 iterations in 0.2760 sec --> Time elapsed: 1.27 seconds CPU times: total: 18 s Wall time: 2.45 s
Todo junto¶
Superpone los puntos transformados en el embeddingoriginal.
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train, y_train, colors=utils.MACOSKO_COLORS, alpha=0.25, ax=ax)
utils.plot(embedding_test, y_test, colors=utils.MACOSKO_COLORS, alpha=0.75, ax=ax)
Ejercicios con OpenTSNE¶
Aplica el modelo t-SNE al dataset MNIST.
Load data¶
Load MNIST dataset: https://www.kaggle.com/weiouyang/test-dataset/version/1
import gzip
import pickle
import sys
import matplotlib.pyplot as plt
f = gzip.open('data/mnist.pkl.gz', 'rb')
if sys.version_info < (3,):
(X_train, y_train), (X_test, y_test) = pickle.load(f)
else:
(X_train0, y_train0), (X_test0, y_test0) = pickle.load(f, encoding="bytes")
print(X_train0.shape)
print(y_train.shape)
for i in range(9):
plt.subplot(330 + 1 + i)
plt.imshow(X_train0[i], cmap=plt.get_cmap('gray'))
(60000, 28, 28) (60000,)
X_train0=X_train0.reshape(60000,-1)
y_train0 = y_train0.astype(str)
X_test0 = X_test0.reshape(10000,-1)
y_test0 = y_test0.astype(str)
x_train, x_test, y_train, y_test = train_test_split(X_train0, y_train0, test_size=.25, random_state=42)
#X_train0 = X_train0[:10000]
#y_train = y_train[:10000]
#X_test0 = X_test0[:10000]
#y_test = y_test[:10000]
tsne = TSNE(perplexity = 30, metric = 'euclidean', early_exaggeration = 12, random_state = 42, n_jobs = -1, verbose = True, initialization = 'pca')
embedding_train_default = tsne.fit(x_train)
-------------------------------------------------------------------------------- TSNE(early_exaggeration=12, n_jobs=-1, random_state=42, verbose=True) -------------------------------------------------------------------------------- ===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance... --> Time elapsed: 14.91 seconds ===> Calculating affinity matrix... --> Time elapsed: 1.93 seconds ===> Calculating PCA-based initialization... --> Time elapsed: 0.63 seconds ===> Running optimization with exaggeration=12.00, lr=3750.00 for 250 iterations... Iteration 50, KL divergence 5.6614, 50 iterations in 1.4180 sec Iteration 100, KL divergence 5.5525, 50 iterations in 1.4801 sec Iteration 150, KL divergence 5.5307, 50 iterations in 1.4489 sec Iteration 200, KL divergence 5.5215, 50 iterations in 1.4697 sec Iteration 250, KL divergence 5.5154, 50 iterations in 1.4300 sec --> Time elapsed: 7.25 seconds ===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations... Iteration 50, KL divergence 3.2404, 50 iterations in 1.8101 sec Iteration 100, KL divergence 2.9987, 50 iterations in 3.2319 sec Iteration 150, KL divergence 2.8767, 50 iterations in 4.7148 sec Iteration 200, KL divergence 2.7982, 50 iterations in 5.8753 sec Iteration 250, KL divergence 2.7418, 50 iterations in 6.8423 sec Iteration 300, KL divergence 2.6983, 50 iterations in 8.1364 sec Iteration 350, KL divergence 2.6637, 50 iterations in 9.0462 sec Iteration 400, KL divergence 2.6351, 50 iterations in 10.1843 sec Iteration 450, KL divergence 2.6107, 50 iterations in 10.9289 sec Iteration 500, KL divergence 2.5903, 50 iterations in 11.7919 sec --> Time elapsed: 72.57 seconds
utils.plot(embedding_train_default, y_train)
embedding_test_default = embedding_train_default.transform(x_test)
utils.plot(embedding_test_default, y_test)
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search... --> Time elapsed: 1.94 seconds ===> Calculating affinity matrix... --> Time elapsed: 0.12 seconds ===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations... --> Time elapsed: 0.00 seconds ===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations... Iteration 50, KL divergence 216701.8935, 50 iterations in 0.2660 sec Iteration 100, KL divergence 214971.9142, 50 iterations in 0.2700 sec Iteration 150, KL divergence 213923.0646, 50 iterations in 0.2721 sec Iteration 200, KL divergence 213234.8219, 50 iterations in 0.2690 sec Iteration 250, KL divergence 212728.6054, 50 iterations in 0.2780 sec --> Time elapsed: 1.36 seconds
Podemos dibujar ambas combinadas:
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 12')
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 12')
tsne2 = TSNE(perplexity = 30, metric = 'euclidean', early_exaggeration = 1, random_state = 42, n_jobs = 8, verbose = True, initialization = 'pca')
embedding_train_default2 = tsne2.fit(x_train)
embedding_test_default2 = embedding_train_default2.transform(x_test)
-------------------------------------------------------------------------------- TSNE(early_exaggeration=1, n_jobs=8, random_state=42, verbose=True) -------------------------------------------------------------------------------- ===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance... --> Time elapsed: 16.08 seconds ===> Calculating affinity matrix... --> Time elapsed: 0.47 seconds ===> Calculating PCA-based initialization... --> Time elapsed: 0.63 seconds ===> Running optimization with exaggeration=1.00, lr=45000.00 for 250 iterations... Iteration 50, KL divergence 3.4339, 50 iterations in 1.4179 sec Iteration 100, KL divergence 3.1556, 50 iterations in 2.5063 sec Iteration 150, KL divergence 3.0148, 50 iterations in 3.6378 sec Iteration 200, KL divergence 2.9265, 50 iterations in 4.9491 sec Iteration 250, KL divergence 2.8638, 50 iterations in 5.9922 sec --> Time elapsed: 18.50 seconds ===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations... Iteration 50, KL divergence 2.8159, 50 iterations in 7.2589 sec Iteration 100, KL divergence 2.7786, 50 iterations in 7.8043 sec Iteration 150, KL divergence 2.7476, 50 iterations in 9.1626 sec Iteration 200, KL divergence 2.7220, 50 iterations in 10.5171 sec Iteration 250, KL divergence 2.7004, 50 iterations in 11.3700 sec Iteration 300, KL divergence 2.6817, 50 iterations in 12.9160 sec Iteration 350, KL divergence 2.6646, 50 iterations in 13.2518 sec Iteration 400, KL divergence 2.6497, 50 iterations in 14.3092 sec Iteration 450, KL divergence 2.6365, 50 iterations in 16.1026 sec Iteration 500, KL divergence 2.6243, 50 iterations in 16.4693 sec --> Time elapsed: 119.16 seconds ===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search... --> Time elapsed: 1.91 seconds ===> Calculating affinity matrix... --> Time elapsed: 0.02 seconds ===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations... --> Time elapsed: 0.00 seconds ===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations... Iteration 50, KL divergence 217642.0613, 50 iterations in 0.2180 sec Iteration 100, KL divergence 215870.9300, 50 iterations in 0.2250 sec Iteration 150, KL divergence 214806.1101, 50 iterations in 0.2249 sec Iteration 200, KL divergence 214103.7596, 50 iterations in 0.2266 sec Iteration 250, KL divergence 213600.9587, 50 iterations in 0.2223 sec --> Time elapsed: 1.12 seconds
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 12')
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 12')
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default2, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default2, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 1')
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 1')
import pandas as pd
kl_train_1 = embedding_train_default.kl_divergence
kl_test_1 = embedding_test_default.kl_divergence
kl_train_2 = embedding_train_default2.kl_divergence
kl_test_2 = embedding_test_default2.kl_divergence
df_kl = pd.DataFrame({
'KL (train)': [kl_train_1, kl_train_2],
'KL (test)': [kl_test_1, kl_test_2]
}, index=['Embedding 1', 'Embedding 2'])
df_kl
| KL (train) | KL (test) | |
|---|---|---|
| Embedding 1 | 2.589916 | 206638.286184 |
| Embedding 2 | 2.624046 | 207510.810251 |
El factor de early_exaggeration se utiliza típicamente durante la fase inicial. Este aumenta, básicamente, las fuerzas atractivas entre los puntos y permite que los puntos se muevan más libremente, encontrando más fácilmente los vecinos más cercanos.
El aumento de este valor suele llevar a clusters más separados. Es un poco difícil de discernir visualmente, pero sí que hay algunos indicios de este comportamiento. Vemos por ejemplo que en el caso de early_exaggeration = 1 tenemos la clase 2 dividida en varios fragmentos, sin estar toda unida.
tsne3 = TSNE(perplexity = 30, metric = 'euclidean', early_exaggeration = 12, random_state = 42, n_jobs = 8, verbose = True, initialization = 'random')
embedding_train_default3 = tsne3.fit(x_train)
embedding_test_default3 = embedding_train_default3.transform(x_test)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, initialization='random', n_jobs=8, random_state=42,
verbose=True)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
--> Time elapsed: 15.91 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.43 seconds
===> Running optimization with exaggeration=12.00, lr=3750.00 for 250 iterations...
Iteration 50, KL divergence 7.0629, 50 iterations in 1.3290 sec
Iteration 100, KL divergence 5.6295, 50 iterations in 1.3935 sec
Iteration 150, KL divergence 5.5525, 50 iterations in 1.3452 sec
Iteration 200, KL divergence 5.5316, 50 iterations in 1.3288 sec
Iteration 250, KL divergence 5.5247, 50 iterations in 1.4411 sec
--> Time elapsed: 6.84 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations...
Iteration 50, KL divergence 3.2632, 50 iterations in 1.5996 sec
Iteration 100, KL divergence 3.0169, 50 iterations in 2.8510 sec
Iteration 150, KL divergence 2.8940, 50 iterations in 4.6382 sec
Iteration 200, KL divergence 2.8147, 50 iterations in 5.5003 sec
Iteration 250, KL divergence 2.7574, 50 iterations in 6.6277 sec
Iteration 300, KL divergence 2.7138, 50 iterations in 7.7444 sec
Iteration 350, KL divergence 2.6785, 50 iterations in 9.1228 sec
Iteration 400, KL divergence 2.6497, 50 iterations in 10.2350 sec
Iteration 450, KL divergence 2.6252, 50 iterations in 11.1869 sec
Iteration 500, KL divergence 2.6048, 50 iterations in 12.8381 sec
--> Time elapsed: 72.35 seconds
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
--> Time elapsed: 1.90 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.02 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
--> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration 50, KL divergence 216671.2177, 50 iterations in 0.2300 sec
Iteration 100, KL divergence 214961.4632, 50 iterations in 0.2420 sec
Iteration 150, KL divergence 213965.0271, 50 iterations in 0.2330 sec
Iteration 200, KL divergence 213305.9178, 50 iterations in 0.2178 sec
Iteration 250, KL divergence 212823.7289, 50 iterations in 0.2146 sec
--> Time elapsed: 1.14 seconds
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 12, initialization = pca')
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 12, initialization = pca')
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default3, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default3, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 1, initialization = random')
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 1, initialization = random')
import pandas as pd
kl_train_1 = embedding_train_default.kl_divergence
kl_test_1 = embedding_test_default.kl_divergence
kl_train_3 = embedding_train_default3.kl_divergence
kl_test_3 = embedding_test_default3.kl_divergence
df_kl = pd.DataFrame({
'KL (train)': [kl_train_1, kl_train_3],
'KL (test)': [kl_test_1, kl_test_3]
}, index=['Embedding 1', 'Embedding 3'])
df_kl
| KL (train) | KL (test) | |
|---|---|---|
| Embedding 1 | 2.589916 | 206638.286184 |
| Embedding 3 | 2.604412 | 206733.721094 |
Los resultados en cuestiones de divergencia de Kullback-Leibler son bastante similares, pero se obtiene un mejor resultado para el de inicialización aleatoria. Esto no es nada concluyente, ya que la aleatoriedad es muy fuerte y con solo variar el random_state cambiarían nuestros resultados.
Con el de inicialización random tarda ligeramente más, ya que estamos comenzando con una distribución aleatoria de puntos en el espacio de embedding. Esto lleva a una convergencia a la solución diferente en comparación a la inicialización con PCA, donde los puntos iniciales están más agrupados según la estructura de los datos de entrada.
tsne4 = TSNE(perplexity = 1, metric = 'euclidean', early_exaggeration = 12, random_state = 42, n_jobs = 8, verbose = True,
initialization = 'pca')
tsne5 = TSNE(perplexity = 100, metric = 'euclidean', early_exaggeration = 12, random_state = 42, n_jobs = 8, verbose = True,
initialization = 'pca')
embedding_train_default4 = tsne4.fit(x_train)
embedding_test_default4 = embedding_train_default4.transform(x_test)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, n_jobs=8, perplexity=1, random_state=42,
verbose=True)
--------------------------------------------------------------------------------
===> Finding 3 nearest neighbors using Annoy approximate search using euclidean distance...
--> Time elapsed: 8.20 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.02 seconds
===> Calculating PCA-based initialization...
--> Time elapsed: 0.63 seconds
===> Running optimization with exaggeration=12.00, lr=3750.00 for 250 iterations...
Iteration 50, KL divergence 8.0306, 50 iterations in 1.0812 sec
Iteration 100, KL divergence 7.2541, 50 iterations in 1.0988 sec
Iteration 150, KL divergence 6.8804, 50 iterations in 1.0940 sec
Iteration 200, KL divergence 6.6425, 50 iterations in 1.0825 sec
Iteration 250, KL divergence 6.4701, 50 iterations in 1.0576 sec
--> Time elapsed: 5.41 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations...
Iteration 50, KL divergence 5.0642, 50 iterations in 1.3456 sec
Iteration 100, KL divergence 4.5599, 50 iterations in 2.4179 sec
Iteration 150, KL divergence 4.2578, 50 iterations in 3.7085 sec
Iteration 200, KL divergence 4.0420, 50 iterations in 4.7750 sec
Iteration 250, KL divergence 3.8749, 50 iterations in 6.0586 sec
Iteration 300, KL divergence 3.7386, 50 iterations in 7.1351 sec
Iteration 350, KL divergence 3.6243, 50 iterations in 7.8254 sec
Iteration 400, KL divergence 3.5257, 50 iterations in 9.1333 sec
Iteration 450, KL divergence 3.4398, 50 iterations in 10.3246 sec
Iteration 500, KL divergence 3.3625, 50 iterations in 11.2153 sec
--> Time elapsed: 63.94 seconds
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
--> Time elapsed: 1.89 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.02 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
--> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration 50, KL divergence 230280.3691, 50 iterations in 0.2020 sec
Iteration 100, KL divergence 228329.7409, 50 iterations in 0.2182 sec
Iteration 150, KL divergence 226994.9016, 50 iterations in 0.2317 sec
Iteration 200, KL divergence 226028.1811, 50 iterations in 0.2295 sec
Iteration 250, KL divergence 225261.7174, 50 iterations in 0.2176 sec
--> Time elapsed: 1.10 seconds
embedding_train_default5 = tsne5.fit(x_train)
embedding_test_default5 = embedding_train_default5.transform(x_test)
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, n_jobs=8, perplexity=100, random_state=42,
verbose=True)
--------------------------------------------------------------------------------
===> Finding 300 nearest neighbors using Annoy approximate search using euclidean distance...
--> Time elapsed: 26.37 seconds
===> Calculating affinity matrix...
--> Time elapsed: 1.55 seconds
===> Calculating PCA-based initialization...
--> Time elapsed: 0.63 seconds
===> Running optimization with exaggeration=12.00, lr=3750.00 for 250 iterations...
Iteration 50, KL divergence 4.9143, 50 iterations in 1.9509 sec
Iteration 100, KL divergence 4.9994, 50 iterations in 2.0278 sec
Iteration 150, KL divergence 5.0014, 50 iterations in 1.9024 sec
Iteration 200, KL divergence 5.0013, 50 iterations in 1.9296 sec
Iteration 250, KL divergence 5.0013, 50 iterations in 1.8109 sec
--> Time elapsed: 9.62 seconds
===> Running optimization with exaggeration=1.00, lr=45000.00 for 500 iterations...
Iteration 50, KL divergence 2.5421, 50 iterations in 1.9706 sec
Iteration 100, KL divergence 2.3885, 50 iterations in 2.9697 sec
Iteration 150, KL divergence 2.3220, 50 iterations in 4.0773 sec
Iteration 200, KL divergence 2.2834, 50 iterations in 5.1269 sec
Iteration 250, KL divergence 2.2561, 50 iterations in 5.6893 sec
Iteration 300, KL divergence 2.2365, 50 iterations in 6.6018 sec
Iteration 350, KL divergence 2.2200, 50 iterations in 7.2386 sec
Iteration 400, KL divergence 2.2073, 50 iterations in 8.1607 sec
Iteration 450, KL divergence 2.1962, 50 iterations in 8.5898 sec
Iteration 500, KL divergence 2.1879, 50 iterations in 9.1279 sec
--> Time elapsed: 59.55 seconds
===> Finding 15 nearest neighbors in existing embedding using Annoy approximate search...
--> Time elapsed: 2.14 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.02 seconds
===> Running optimization with exaggeration=4.00, lr=0.10 for 0 iterations...
--> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=1.50, lr=0.10 for 250 iterations...
Iteration 50, KL divergence 217079.8194, 50 iterations in 0.2080 sec
Iteration 100, KL divergence 215606.6905, 50 iterations in 0.2101 sec
Iteration 150, KL divergence 214807.1805, 50 iterations in 0.1970 sec
Iteration 200, KL divergence 214280.5428, 50 iterations in 0.2190 sec
Iteration 250, KL divergence 213899.1103, 50 iterations in 0.2300 sec
--> Time elapsed: 1.06 seconds
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default4, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default4, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 1, early_exaggeration = 12')
Text(0.5, 1.0, 'perplexity = 1, early_exaggeration = 12')
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default5, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default5, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 100, early_exaggeration = 12')
Text(0.5, 1.0, 'perplexity = 100, early_exaggeration = 12')
Cuando ejecutamos el modelo t-SNE con valores de perplexity muy bajos, como 1, y muy altos, como 100, obtenemos resultados notablemente diferentes debido a cómo afecta la perplexity al proceso de optimización de t-SNE.
Perplexity 1: Con una perplexity tan baja, el modelo no puede capturar adecuadamente la estructura global de los datos. Esto se debe a que la perplexity controla el número de vecinos considerados en el cálculo de las distribuciones de probabilidad. Con una perplexity de 1, solo se consideran los vecinos más cercanos, lo que resulta en una representación extremadamente local de los datos. Esto puede llevar a agrupaciones deficientes y a una interpretación inadecuada de la estructura de los datos en el espacio de menor dimensión.
Perplexity 100: Por el contrario, con una perplexity tan alta, el modelo puede capturar mejor la estructura global de los datos al considerar más vecinos en el cálculo de las distribuciones de probabilidad. Sin embargo, esto puede llevar a una simplificación excesiva de la estructura local de los datos, lo que resulta en una representación donde se pierden las relaciones locales en favor de la estructura global.
En resumen, al usar valores extremos de perplexity como 1 y 100, observamos problemas como la incapacidad para capturar la estructura global de los datos o la pérdida de detalles locales. Es importante ajustar la perplexity de manera adecuada para encontrar un equilibrio entre la representación global y local de los datos.
fig, ax = plt.subplots(figsize=(8, 8))
utils.plot(embedding_train_default, y_train, alpha=0.25, ax=ax)
utils.plot(embedding_test_default, y_test, alpha=0.75, ax=ax)
ax.set_title('perplexity = 30, early_exaggeration = 12')
Text(0.5, 1.0, 'perplexity = 30, early_exaggeration = 12')
La idea principal detrás de t-SNE es preservar la estructura local y global de los datos durante la reducción de dimensionalidad. Funciona calculando una distribución de probabilidad conjunta sobre pares de puntos en el espacio original, y una distribución de probabilidad similar en el espacio de menor dimensión.
Luego, ajusta los puntos en el espacio de menor dimensión para minimizar la divergencia entre estas dos distribuciones de probabilidad, generalmente utilizando la divergencia de Kullback-Leibler.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
colors = ['royalblue','red','deeppink', 'maroon', 'mediumorchid', 'tan', 'forestgreen', 'olive', 'goldenrod', 'lightcyan', 'navy']
vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=200, noise=0.01, random_state = 42)
plt.scatter(X[:,0], X[:,1],c=vectorizer(y))
<matplotlib.collections.PathCollection at 0x1b7ce0aac90>
Para más información de TSNE en sklearn se tiene el enlace siguiente:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
perplexity: está relacionada con el número de vecinos cercanos que se usan en los algoritmos de manifold learning. Los datasets más grandes suelen requerir un valor mayor.early_exaggeration: controla cómo de ajustados son los grupos naturales en el espacio original en el espacio incrustado y cuánto espacio habrá entre ellos. Valores más grandes indican que hay más lugar entre los grupos naturales en el espacio incrustado.
def visualize_tsne(X, y, perplexities):
plt.figure(figsize=(18, 6))
num_perplexities = len(perplexities)
kl_divergences = {} # Diccionario para almacenar las divergencias KL
for i, perplexity in enumerate(perplexities, start=1):
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)
X_tsne = tsne.fit_transform(X)
plt.subplot(1, num_perplexities, i)
unique_classes = np.unique(y)
for cls in unique_classes:
indices = np.where(y == cls)
plt.scatter(X_tsne[indices, 0], X_tsne[indices, 1], label=f'Class {cls}', alpha=0.8)
plt.title(f"Perplexity = {perplexity}")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend()
plt.grid(True)
kl_divergences[perplexity] = tsne.kl_divergence_
plt.tight_layout()
plt.show()
return kl_divergences
perplexities = [5, 30, 100]
kl_divergences = visualize_tsne(X, y, perplexities)
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn( c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn( c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
Vemos que con un valor de perplexity bajo las cosas no parecen ir bien y el aspecto de círculo desaparece completamente.
Con un perplexity de 30 vemos en alguna zona cosas extrañas, mientras que con un valor de perplexity de 100 todo parece estar perfectamente.
print("KL Divergences:", kl_divergences)
KL Divergences: {5: 0.4045565128326416, 30: 0.19363267719745636, 100: 0.07120548188686371}
Vemos cómo aumentando el perplexity conseguimos divergencias de Kullback-Leibler cada vez menores, indicando que la discrepancia entre la distribución de probabilidad de los datos en el espacio original y la de estos en el reducido va disminuyendo.
import time
def run_tsne(X, perplexity, method):
start_time = time.time()
tsne = TSNE(n_components = 2, perplexity = perplexity, method = method, random_state = 42)
X_tsne = tsne.fit_transform(X)
end_time = time.time()
execution_time = end_time - start_time
return X_tsne, execution_time
perplexity = 100
# Método de Barnes-Hut
X_tsne_bh, execution_time_bh = run_tsne(X, perplexity, method = 'barnes_hut')
# Método exacto
X_tsne_exact, execution_time_exact = run_tsne(X, perplexity, method = 'exact')
print(f"Tiempo de ejecución usando Barnes-Hut: {execution_time_bh} segundos")
print(f"Tiempo de ejecución usando método exacto: {execution_time_exact} segundos")
c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn( c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( c:\Users\jesus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\manifold\_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
Tiempo de ejecución usando Barnes-Hut: 0.28653502464294434 segundos Tiempo de ejecución usando método exacto: 0.4057042598724365 segundos
Vemos como con el método de Barnes-Hut el tiempo requerido en el cálculo es algo inferior.
La idea principal tras este es que, en lugar de calcular la interacción de cada punto con todos los demás en el espacio de alta dimensión, aproximamos la influencia de grupos de puntos distantes usando su centroide. El espacio se divide en "celdas", por lo que en lugar de calcular todas las similitudes se calculan por esas "celdas".